An Introduction to Applied Multivariate Analysis with R (Use R!) by Brian Everitt

An Introduction to Applied Multivariate Analysis with R (Use R!) by Brian Everitt

Author:Brian Everitt [Everitt, Brian]
Language: eng
Format: mobi
Publisher: Springer
Published: 2013-04-10T14:00:00+00:00


168

6 Cluster Analysis

Fig. 6.2. Inter-cluster distance measures.

Fig. 6.3. Darwin’s Tree of Life.

6.3 Agglomerative hierarchical clustering

169

R> (dm <- dist(measure[, c("chest", "waist", "hips")]))

1

2

3

4

5

6

7

8

9

10

2

6.16

3

5.66

2.45

4

7.87

2.45

4.69

5

4.24

5.10

3.16

7.48

6

11.00

6.08

5.74

7.14

7.68

7

12.04

5.92

7.00

5.00 10.05

5.10

8

8.94

3.74

4.00

3.74

7.07

5.74

4.12

9

7.81

3.61

2.24

5.39

4.58

3.74

5.83

3.61

10 10.10

4.47

4.69

5.10

7.35

2.24

3.32

3.74

3.00

11

7.00

8.31

6.40

9.85

5.74 11.05 12.08

8.06

7.48 10.25

12

7.35

7.07

5.48

8.25

6.00

9.95 10.25

6.16

6.40

8.83

13

7.81

8.54

7.28

9.43

7.55 12.08 11.92

7.81

8.49 10.82

14

8.31 11.18

9.64 12.45

8.66 14.70 15.30 11.18 11.05 13.75

15

7.48

6.16

4.90

7.07

6.16

9.22

9.00

4.90

5.74

7.87

16

7.07

6.00

4.24

7.35

5.10

8.54

9.11

5.10

5.00

7.48

17

7.81

7.68

6.71

8.31

7.55 11.40 10.77

6.71

7.87

9.95

18

6.71

6.08

4.58

7.28

5.39

9.27

9.49

5.39

5.66

8.06

19

9.17

5.10

4.47

5.48

7.07

6.71

5.74

2.00

4.12

5.10

20

7.68

9.43

7.68 10.82

7.00 12.41 13.19

9.11

8.83 11.53

11

12

13

14

15

16

17

18

19

2

3

4

5

6

7

8

9

10

11

12

2.24

13

2.83

2.24

14

3.74

5.20

3.74

15

3.61

1.41

3.00

6.40

16

3.00

1.41

3.61

6.40

1.41

17

3.74

2.24

1.41

5.10

2.24

3.32

18

2.83

1.00

2.83

5.83

1.00

1.00

2.45

19

6.71

4.69

6.40

9.85

3.46

3.74

5.39

4.12

20

1.41

3.00

2.45

2.45

4.36

4.12

3.74

3.74

7.68

Application of each of the three clustering methods described earlier to

the distance matrix and a plot of the corresponding dendrogram are achieved

using the hclust() function:

170

6 Cluster Analysis

R> plot(cs <- hclust(dm, method = "single"))

R> plot(cc <- hclust(dm, method = "complete"))

R> plot(ca <- hclust(dm, method = "average"))

The resulting plots (for single, complete, and average linkage) are given in the

upper part of Figure 6.4.

Single

Complete

Average

15

8

4.0

1

1

6

3.0

10

7 5

4

Height

2.0

4 2

Height

5

Height

7

6

14

5

10 3 9 8

2

19

7

1 5

14

2 4

1.0

14

6 3 9 8

13 17

11 20

0

6 2 4

10

19

10 8 3 9

19

0

13 17 11 20

16 15 12 18

16 15 12 18 13 17 11 20

16 15 12 18

dm

dm

dm

hclust (*, "single")

hclust (*, "complete")

hclust (*, "average")

2

1

1

2

2

2

1

2

2

2

1

2

2

2

1

2

2

1

1

2

2

22

2

2 2

1

11

1

2 2

2

11

1

PC2

22

2

2

2

2

2

22

PC2

2 2

1

2

11

PC2

2 2

1

2

11

−4

−4

−4

−4

0

4

−4

0

4

−4

0

4

PC1

PC1

PC1

Fig. 6.4. Cluster solutions for measure data. The top row gives the cluster dendro-

grams along with the cutoff used to derive the classes presented (in the space of the

first two principal components) in the bottom row.

We now need to consider how we select specific partitions of the data (i.e.,

a solution with a particular number of groups) from these dendrograms. The

answer is that we “cut” the dendrogram at some height and this will give a

partition with a particular number of groups. How do we choose where to cut

or, in other words, how do we decide on a particular number of groups that is,

in some sense, optimal for the data? This is a more difficult question to answer.

6.3 Agglomerative hierarchical clustering

171

One informal approach is to examine the sizes of the changes in height in the

dendrogram and take a “large” change to indicate the appropriate number of

clusters for the data. (More formal approaches are described in Everitt et al.

2011) Even using this informal approach on the dendrograms in Figure 6.4, it is not easy to decide where to “cut”.

So instead, because we know that these data consist of measurements on

ten men and ten women, we will look at the two-group solutions from each

method that are obtained by cutting the dendrograms at suitable heights. We

can display and compare the three solutions graphically by plotting the first

two principal component scores of the data, labelling the points to identify

the cluster solution of one of the methods by using the following code:

R> body_pc <- princomp(dm, cor = TRUE)

R> xlim <- range(body_pc$scores[,1])

R> plot(body_pc$scores[,1:2], type = "n",

+

xlim = xlim, ylim = xlim)

R> lab <- cutree(cs, h = 3.8)

R> text(body_pc$scores[,1:2], labels = lab, cex = 0.6)

The resulting plots are shown in the lower part of Figure 6.4. The plots of

dendrograms and principal components scatterplots are combined into a single

diagram using the layout() function (see the chapter demo for the complete

R code). The plot associated with the single linkage solution immediately

demonstrates one of the problems with using this method in practise, and

that is a phenomenon known as chaining, which refers to the tendency to

incorporate intermediate points between clusters into an existing cluster rather

than initiating a new one.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.